Biomedical Information Extraction with Predicate-Argument Structure Patterns
نویسندگان
چکیده
Due to the ever growing amount of publications, Information Extraction (IE) from text is increasingly is recognized as one of crucial technologies in bioinformatics. However, for IE to be practically applicable, adaptability/portability of a system is crucial, considering extremely diverse demands in biomedical IE application. We should be able to construct a set of “extraction rules” adapted for a specific application at low cost. We propose a new method for automatic construction of application-specific extraction rules, which effectively utilizes predicate-argument structures (PASs) produced by a full-parser. By dividing labor between generic linguistic rules in the parser and application-specific extraction rules to be constructed from scratch, this method facilitates acquisition of extraction rules from a relatively small annotated corpus. We conducted an experiment in which the method was applied to extraction of protein-protein interaction. The result shows that, though the current version of the construction algorithm is straightforward, the performance is remarkably promising, comparable with those obtained by manual-made extraction rules or those obtained by rules generalized by machine learning techniques. Introduction Although Information Extraction (IE) from text is increasingly recognized as a crucial component in bioinformatics, it has hardly been used yet in the process of actual data curation, knowledge integration/discovery, etc. This is because (1) [Quality] the performance of current IE in terms of recall and precision is not good enough, and/or (2) [Portability/Adaptability] current IE requires a lot of human effort to adapt for particular information needs in specific applications. While IE systems whose extraction rules are carefully crafted and adapted for particular applications (such as [8, 3]) show near-practical performance, the same performance can hardly be repeated in different applications that have to deal with different kinds of information (protein-protein interaction, disease-gene association, toxicity of materials, etc.). Manual engineering of IE systems is a tedious and time-consuming process. Techniques based on machine learning (ML) (such as [6, 2]) are expected to alleviate this difficulty in manually-crafted IE. However, in most cases, they simply transfer the cost of manually crafting rules to that of constructing a large amount of training data, which in case of IE requires tedious manual labor of annotating text. It is also the case that, when they are applied directly to surface sequences of words in text, ML techniques as they are have shown poor results. In order to render IE techniques practical in biomedical domains, it is crucial that a generic part of a system, which can be transferred across IE systems in different applications, is clearly distinguished from applicationspecific part and thereby the cost of adaptation could be minimized. In this paper, we propose a new system architecture, in which a full parser plays a significant role for improving the quality of performance as well as increasing the adaptability of an IE system. A full parser is a program that takes a sentence as input to produce its semantic representation (predicate-argument structure: PAS). While a full parser embodies linguistic knowledge that is valid across different applications, extraction rules that are application-dependent have to be constructed from scratch. Because diverse forms of surface sentences with the same meanings are reduced into single PASs by a full parser, the construction algorithm for extraction rules is much simpler than those seeing sentences as mere sequences of words and can acquire rules by using a much smaller training set. The rules constructed thus give a promising performance (37.3% precision and 45.3% recall without any manual intervention for IE of protein-protein interactions). Furthermore, while we do not discuss in this paper, because extraction rules thus acquired are easy to understand, one can revise and augment them manually and develop IE systems with performance comparable with (or better than) carefully crafted IE systems. This paper discusses details of the construction algorithms, the performance of an IE system and future development after briefly discussing the full parser. Previous Work Research for biomedical interaction extraction from text is now attracting many works [4, 13, 12, 1, 14, 9, 8, 17, 3, 2, 6, 20]. Their IE systems include a process that reduces diverse surface forms in text into a standard structure by natural language processing (NLP) and makes extraction rules on the structure. There are works using pattern matching [12, 1, 2, 6] and ones using shallow parsing [14, 9, 8] or full parsing [21, 13, 4, 17, 3]. Another categorization of the works is how they construct extraction rules. One approach is based on hand-written rule sets [12, 4, 1, 14, 17, 8, 3]. The other is rule generation by ML based on a corpus with desired information [9, 2, 6]. Some latest works related closely to ours are as follows. Daraselia et al. [3] used a full parser based on contextfree grammar and a lexicon developed specifically for MEDLINE. They wrote extraction rules on semantic trees and extracted mammalian protein functional links by 91% precision and the estimated recall was 30–50%. Their extraction rules require much manual modification to apply to different kinds of information. Bunescu et al. [2] used machine learning technique to construct extraction rules on surface words as interfillers (text fragments between participating entities), role-fillers and longest common subsequences which represent protein-protein interactions. The corpus they used is Aimed, which consists of 230 MEDLINE abstracts annotated with protein names and proteinprotein interactions. They reached about 48% precision for 45% recall. One of the shortcomings of the system is that generated patterns are hard to augment manually to improve performance because the patterns are not ensured syntactically or semantically. Huang et al. [6] used a dynamic programming algorithm to obtain patterns on parts-of-speech and surface words for protein-protein interactions. On sentences which include keywords, their precision was 80.5% and recall was 80.0%. The system requires training corpus on which sentences are aligned to estimate parameters. Besides the biomedical domain, there are works which acquire extraction rules automatically in other domains. EXDISCO by Yangarber et al. [22] identifies a set of relevant documents and a set of extraction rules from un-annotated text, starting from a small set of seed rules. The rules are constructed on results of a general-purpose dependency parse. Their result was 73% precision and 57% recall on MUC-6 corpus [11]. Sudo et al. [16] acquired extraction rules as subtrees derived from dependency trees of sentences in automatically retrieved un-annotated documents. Their result was about 75% precision and 55% recall on the activate
منابع مشابه
Automatic Construction of Predicate-argument Structure Patterns for Biomedical Information Extraction
This paper presents a method of automatically constructing information extraction patterns on predicate-argument structures (PASs) obtained by full parsing from a smaller training corpus. Because PASs represent generalized structures for syntactical variants, patterns on PASs are expected to be more generalized than those on surface words. In addition, patterns are divided into components to im...
متن کاملOpen Information Extraction from Biomedical Literature Using Predicate-Argument Structure Patterns
In this paper, we propose an open information extraction (Open IE) system, which attempts to extract relations (or facts) of any type from biomedical literature. What distinguishes our system from existing Open IE systems is that it uses predicateargument structure patterns to detect the candidates of possible biomedical facts. We have manually evaluated the output of our system and found that ...
متن کاملFinding Anchor Verbs for Biomedical IE Using Predicate-Argument Structures
For biomedical information extraction, most systems use syntactic patterns on verbs (anchor verbs) and their arguments. Anchor verbs can be selected by focusing on their arguments. We propose to use predicate-argument structures (PASs), which are outputs of a full parser, to obtain verbs and their arguments. In this paper, we evaluated PAS method by comparing it to a method using part of speech...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملUtilizing Various Natural Language Processing Techniques for Biomedical Interaction Extraction
The vast number of biomedical literature is an important source of biomedical interaction information discovery. However, it is complicated to obtain interaction information from them because most of them are not easily readable by machine. In this paper, we present a method for extracting biomedical interaction information assuming that the biomedical Named Entities (NEs) are already identifie...
متن کاملIntegrated Annotation For Biomedical Information Extraction
We describe an approach to two areas of biomedical information extraction, drug development and cancer genomics. We have developed a framework which includes corpus annotation integrated at multiple levels: a Treebank containing syntactic structure, a Propbank containing predicate-argument structure, and annotation of entities and relations among the entities. Crucial to this approach is the pr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005